IMPORTING¶

The first thing we always do when we start with our work is to import the libraries. You don’t need to import every single library that you’ll be using in the notebook right at the beginning itself. You can start with the bare minimum and those include:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("C:/Users/91852/Downloads/archive (1)/HR_comma_sep.csv")

Just the above are enough when you start working on a problem. From then on you can just add them at the start or you can just import them where ever you are in the notebook.

Now that we are done with importing let’s load the data.

In [2]:
df
Out[2]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
... ... ... ... ... ... ... ... ... ... ...
14994 0.40 0.57 2 151 3 0 1 0 support low
14995 0.37 0.48 2 160 3 0 1 0 support low
14996 0.37 0.53 2 143 3 0 1 0 support low
14997 0.11 0.96 6 280 4 0 1 0 support low
14998 0.37 0.52 2 158 3 0 1 0 support low

14999 rows × 10 columns

In [22]:
df.head()
Out[22]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [23]:
df.tail()
Out[23]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
14994 0.40 0.57 2 151 3 0 1 0 support low
14995 0.37 0.48 2 160 3 0 1 0 support low
14996 0.37 0.53 2 143 3 0 1 0 support low
14997 0.11 0.96 6 280 4 0 1 0 support low
14998 0.37 0.52 2 158 3 0 1 0 support low
In [24]:
df.count()
Out[24]:
satisfaction_level       14999
last_evaluation          14999
number_project           14999
average_montly_hours     14999
time_spend_company       14999
Work_accident            14999
left                     14999
promotion_last_5years    14999
Department               14999
salary                   14999
dtype: int64
In [25]:
df.shape
Out[25]:
(14999, 10)
In [26]:
df.columns
Out[26]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department', 'salary'],
      dtype='object')
In [27]:
df.groupby("left").mean()
Out[27]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident promotion_last_5years
left
0 0.666810 0.715473 3.786664 199.060203 3.380032 0.175009 0.026251
1 0.440098 0.718113 3.855503 207.419210 3.876505 0.047326 0.005321

above table shows that those who left that has 44% satisfaction and they are worked more hours 207 but they are not promoted in last 5 years so that's why they are leaving the job¶

In [28]:
pd.crosstab(df.salary, df.left).plot(kind='bar')
Out[28]:
<AxesSubplot:xlabel='salary'>

Here above results shows that employee with high salaries are likely to to not leave the company¶

In [29]:
dependentDf = df[['satisfaction_level','average_montly_hours','promotion_last_5years','salary']]
dependentDf.head()
Out[29]:
satisfaction_level average_montly_hours promotion_last_5years salary
0 0.38 157 0 low
1 0.80 262 0 medium
2 0.11 272 0 medium
3 0.72 223 0 low
4 0.37 159 0 low

CHECKING FOR THE NEATNESS OF THE DATA¶

Now that the data is loaded. Let’s take a look at how the data is, i.e, the data types, the no of NaN values etc., We can do that by using the .info() method

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

There seem to be no weird values in the data, which is very good. Let’s look at the metrics.

In [4]:
df.describe()
Out[4]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

EXPLORATORY DATA ANALYSIS (EDA)¶

In this, we will be visualizing the data. By doing this we will be able to further understand how the data is and if there is any work that is to be done.

In [6]:
import seaborn as sns
In [7]:
sns.set()   #sets the style of the plot.
fig = plt.figure(figsize=(12,6))   #Used to display the plot
sns.barplot(x='Department', y='satisfaction_level', hue='salary', data=df, ci=None)
plt.title("Satisfaction_level Vs Department", size=15)
plt.show()

The ones with higher salaries are more satisfied. But product_mng said hold my cup. It seems like the ones with low salaries are more satisfied than the ones with high salaries. The same is for the land.¶

In [8]:
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='last_evaluation', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Last Evaluation Vs Department", size=15)
plt.show()

Everybody has a similar level of evaluation done.¶

In [9]:
fig = plt.figure(figsize=(12,6))
sns.barplot(x='Department', y='number_project', data=df, ci=None)
plt.title("Number Project Vs Department", size=15)
plt.show()
In [10]:
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='Work_accident', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Work Accident Vs Department", size=15)
plt.show()
In [11]:
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='satisfaction_level', hue='left', data=df, ci=None)
g.bar_label(g.containers[0])
g.bar_label(g.containers[1], rotation=90)
plt.title("Left Vs Department with satisfaction level", size=15)
plt.show()

line plot¶

In [12]:
fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='time_spend_company', data=df, ci=None, color='r', marker='o')
plt.title("Time spent per each Department", size=15)
plt.show()
In [19]:
ig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='average_montly_hours', data=df, ci=None, color='g', marker='*',ms=20)
plt.title("Avg Hours spent in the company per Department", size=15)
plt.show()
In [21]:
fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='promotion_last_5years', data=df, ci=None, color='black', marker='D',ms=15)
plt.title("Promotions of Last 5 years in the company per Department", size=15)
plt.show()
In [30]:
salary_dummies = pd.get_dummies(dependentDf.salary, prefix='salary')
salary_dummies
Out[30]:
salary_high salary_low salary_medium
0 0 1 0
1 0 0 1
2 0 0 1
3 0 1 0
4 0 1 0
... ... ... ...
14994 0 1 0
14995 0 1 0
14996 0 1 0
14997 0 1 0
14998 0 1 0

14999 rows × 3 columns

In [31]:
df_with_dummies = pd.concat([dependentDf,salary_dummies], axis='columns')
In [32]:
df_with_dummies.head()
Out[32]:
satisfaction_level average_montly_hours promotion_last_5years salary salary_high salary_low salary_medium
0 0.38 157 0 low 0 1 0
1 0.80 262 0 medium 0 0 1
2 0.11 272 0 medium 0 0 1
3 0.72 223 0 low 0 1 0
4 0.37 159 0 low 0 1 0
In [33]:
df_with_dummies.drop('salary', axis='columns', inplace=True)
In [34]:
df_with_dummies.head()
Out[34]:
satisfaction_level average_montly_hours promotion_last_5years salary_high salary_low salary_medium
0 0.38 157 0 0 1 0
1 0.80 262 0 0 0 1
2 0.11 272 0 0 0 1
3 0.72 223 0 0 1 0
4 0.37 159 0 0 1 0
In [35]:
y = df.left
y.head()
Out[35]:
0    1
1    1
2    1
3    1
4    1
Name: left, dtype: int64
In [36]:
df['left'].value_counts()
Out[36]:
0    11428
1     3571
Name: left, dtype: int64
In [37]:
sns.pairplot(df)
Out[37]:
<seaborn.axisgrid.PairGrid at 0x223ce4bf3a0>
In [38]:
cols = ['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years']
In [39]:
fig = plt.figure(figsize=(20,12))
corr = df[cols].corr()
sns.heatmap(corr,cbar= True, annot = True, fmt = '.2f', annot_kws = {'size':10}, yticklabels =cols, xticklabels = cols )
plt.show()
In [40]:
#Logistics Regression model
df1 = df[['salary','satisfaction_level',
 'average_montly_hours',
 'promotion_last_5years','left']]
df1
Out[40]:
salary satisfaction_level average_montly_hours promotion_last_5years left
0 low 0.38 157 0 1
1 medium 0.80 262 0 1
2 medium 0.11 272 0 1
3 low 0.72 223 0 1
4 low 0.37 159 0 1
... ... ... ... ... ...
14994 low 0.40 151 0 1
14995 low 0.37 160 0 1
14996 low 0.37 143 0 1
14997 low 0.11 280 0 1
14998 low 0.37 158 0 1

14999 rows × 5 columns

In [41]:
dummies = pd.get_dummies(df1.salary)
dummies
Out[41]:
high low medium
0 0 1 0
1 0 0 1
2 0 0 1
3 0 1 0
4 0 1 0
... ... ... ...
14994 0 1 0
14995 0 1 0
14996 0 1 0
14997 0 1 0
14998 0 1 0

14999 rows × 3 columns

Use pandas.concat() to concatenate/merge two or multiple pandas DataFrames across rows or columns. When you concat() two pandas DataFrames on rows, it creates a new Dataframe containing all rows of two DataFrames basically it does append one DataFrame with another. When you use concat() on columns it performs the join operation. data = [df, df1] df2 = pd.concat(data)

In [42]:
df1 = pd.concat([df1,dummies],axis = 'columns')
df1
Out[42]:
salary satisfaction_level average_montly_hours promotion_last_5years left high low medium
0 low 0.38 157 0 1 0 1 0
1 medium 0.80 262 0 1 0 0 1
2 medium 0.11 272 0 1 0 0 1
3 low 0.72 223 0 1 0 1 0
4 low 0.37 159 0 1 0 1 0
... ... ... ... ... ... ... ... ...
14994 low 0.40 151 0 1 0 1 0
14995 low 0.37 160 0 1 0 1 0
14996 low 0.37 143 0 1 0 1 0
14997 low 0.11 280 0 1 0 1 0
14998 low 0.37 158 0 1 0 1 0

14999 rows × 8 columns

By using pandas.DataFrame.drop() method you can drop/remove/delete rows from DataFrame. axis param is used to specify what axis you would like to remove. By default axis = 0 meaning to remove rows. Use axis=1 or columns param to remove columns. By default, pandas return a copy DataFrame after deleting rows, use inpalce=True to remove from existing referring DataFrame.

In [43]:
df1 = df1.drop(['salary','medium'],axis='columns')
df1
Out[43]:
satisfaction_level average_montly_hours promotion_last_5years left high low
0 0.38 157 0 1 0 1
1 0.80 262 0 1 0 0
2 0.11 272 0 1 0 0
3 0.72 223 0 1 0 1
4 0.37 159 0 1 0 1
... ... ... ... ... ... ...
14994 0.40 151 0 1 0 1
14995 0.37 160 0 1 0 1
14996 0.37 143 0 1 0 1
14997 0.11 280 0 1 0 1
14998 0.37 158 0 1 0 1

14999 rows × 6 columns

In [44]:
df.describe().T
Out[44]:
count mean std min 25% 50% 75% max
satisfaction_level 14999.0 0.612834 0.248631 0.09 0.44 0.64 0.82 1.0
last_evaluation 14999.0 0.716102 0.171169 0.36 0.56 0.72 0.87 1.0
number_project 14999.0 3.803054 1.232592 2.00 3.00 4.00 5.00 7.0
average_montly_hours 14999.0 201.050337 49.943099 96.00 156.00 200.00 245.00 310.0
time_spend_company 14999.0 3.498233 1.460136 2.00 3.00 3.00 4.00 10.0
Work_accident 14999.0 0.144610 0.351719 0.00 0.00 0.00 0.00 1.0
left 14999.0 0.238083 0.425924 0.00 0.00 0.00 0.00 1.0
promotion_last_5years 14999.0 0.021268 0.144281 0.00 0.00 0.00 0.00 1.0
In [45]:
df.hist(bins=30, figsize=(20,20), color='g');
In [46]:
pd.crosstab(df.salary, df.left).plot(kind='bar', figsize=(15,7))
Out[46]:
<AxesSubplot:xlabel='salary'>
In [47]:
df.groupby('salary').mean()
Out[47]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
salary
high 0.637470 0.704325 3.767179 199.867421 3.692805 0.155214 0.066289 0.058205
low 0.600753 0.717017 3.799891 200.996583 3.438218 0.142154 0.296884 0.009021
medium 0.621817 0.717322 3.813528 201.338349 3.529010 0.145361 0.204313 0.028079

Separating Dependent and Independent Variables¶

With the help of EDA we come to know that this is a classification problem in which we have independent and dependent variables, based on our independednt variables we will classifing or predicting our output. Assigning "X" as Independent Variable and "y" as Dependent or Target Variable

In [48]:
X = df.drop(['left'], axis=1)
y = df.left
df.head()
Out[48]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [49]:
y.head()
Out[49]:
0    1
1    1
2    1
3    1
4    1
Name: left, dtype: int64

Data Visualization¶¶

In [50]:
sns.pairplot(data=df, hue='left')
Out[50]:
<seaborn.axisgrid.PairGrid at 0x223d5603670>
In [51]:
sns.heatmap(X.corr(), 
        xticklabels=X.columns,
        yticklabels=X.columns)
Out[51]:
<AxesSubplot:>

The features with highest correlations are 'last_evaluation', 'number_project', 'average_montly_hours', and 'time_spend_company'.

We see from the correlation heatmap that the correlation of the target with all the features is low. Moreover, the pairwise distributions indicate that the linear model might not perform well.

In [52]:
df.left.value_counts()
Out[52]:
0    11428
1     3571
Name: left, dtype: int64

Here, you can see out of 15000 approx 3571 were left and 11428 stayed. The no of employees left is 23 % of the total employment.

In [55]:
sns.factorplot(x='number_project', y='last_evaluation',  data=df)
Out[55]:
<seaborn.axisgrid.FacetGrid at 0x223d4212040>
In [57]:
sns.boxplot(x='number_project', y='satisfaction_level', data=df)
Out[57]:
<AxesSubplot:xlabel='number_project', ylabel='satisfaction_level'>
In [58]:
df
Out[58]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
... ... ... ... ... ... ... ... ... ... ...
14994 0.40 0.57 2 151 3 0 1 0 support low
14995 0.37 0.48 2 160 3 0 1 0 support low
14996 0.37 0.53 2 143 3 0 1 0 support low
14997 0.11 0.96 6 280 4 0 1 0 support low
14998 0.37 0.52 2 158 3 0 1 0 support low

14999 rows × 10 columns

In [59]:
accidentplot = plt.figure(figsize=(10,6))
accidentplotax = accidentplot.add_axes([0,0,1,1])
accidentplotax = sns.violinplot(x='Department', y='average_montly_hours', hue='Work_accident', split=True, data = df, jitter = 0.47)

Modeling¶

Let's model the data with a decision tree.

In [60]:
# We now use model_selection instead of cross_validation
from sklearn.model_selection import train_test_split

X = df.drop('left', axis=1)
y = df['left']

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size = 0.3, random_state = 47)
In [70]:
df.dtypes
Out[70]:
satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
Department                object
salary                    object
dtype: object
In [72]:
# Employee distri
# Types of colors
color_types = ['#78C850','#F08030','#6890F0','#A8B820','#A8A878','#A040A0','#F8D030',  
                '#E0C068','#EE99AC','#C03028','#F85888','#B8A038','#705898','#98D8D8','#7038F8']

# Count Plot (a.k.a. Bar Plot)
sns.countplot(x='Department', data=df, palette=color_types).set_title('Employee Department Distribution');
 
# Rotate x-labels
plt.xticks(rotation=-45)
Out[72]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'sales'),
  Text(1, 0, 'accounting'),
  Text(2, 0, 'hr'),
  Text(3, 0, 'technical'),
  Text(4, 0, 'support'),
  Text(5, 0, 'management'),
  Text(6, 0, 'IT'),
  Text(7, 0, 'product_mng'),
  Text(8, 0, 'marketing'),
  Text(9, 0, 'RandD')])

split the data into train and test¶

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1[['satisfaction_level','average_montly_hours','promotion_last_5years','high','low']],df1.left)

LogisticRegression model¶

In [75]:
model = LogisticRegression()
In [76]:
model.fit(X_train,y_train)
Out[76]:
LogisticRegression()
In [77]:
y_pred=model.predict(X_test)
In [78]:
y_pred
Out[78]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
In [79]:
y_test
Out[79]:
8874     0
12046    1
2165     0
14677    1
505      1
        ..
3485     0
2315     0
1284     1
4120     0
7603     0
Name: left, Length: 3750, dtype: int64
In [80]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)
Out[80]:
array([[2639,  168],
       [ 693,  250]], dtype=int64)
In [81]:
model.score(X_test,y_test)
Out[81]:
0.7704

Conclusion¶

It's definitely worth taking our employees' satisfaction levels more seriously. We've discovered that this is related to, among other things, their salary and the number of projects they have. Further study could lead to finding an optimal combination of salary, number of projects, and other important factors in taking care of our people that could lead to better performance and profits for us and a lower employee mortality rate. It's also worth noting that time spent at the company and employee evaluations also have an important effect on whether employees leave or not -- this could ultimately be connected to their work, so it's worth investigating in more detail how departments handle project delegation to their employees and what kinds of projects they're given, especially given that those from HR and accounting tend to have higher leave rates than the other functions.

In [ ]: